class: center, middle, inverse, title-slide .title[ # Simple linear regression: assumptions ] .subtitle[ ## Lecture 2 ] .author[ ### Manuel Villarreal ] .date[ ### 08/28/24 ] --- ### OLS Estimators - From last class we know that the OLS estimators are the solution to our decision problem in which we want to find candidate values for `\(\beta^*_0\)` and `\(\beta^*_1\)` that minimize the distance between the "predicted" line and our observations. -- - In our blood pressure example, that means finding an estimate of the blood pressure of a person that is "0 years old" ( `\(\beta_0\)` ). -- - Finding an estimate of the difference in blood pressure for people who are one year of age apart ( `\(\beta_1\)` ). -- `$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}_1$$` `$$\hat{\beta}_1 = \frac{\sum_{i = 1}^n(y_i - \bar{y})(x_{i1} - \bar{x}_1)}{\sum_{i = 1}^n(x_{i1}-\bar{x}_1)^2}$$` --- ### Assumptions In classical linear regression there are 5 assumptions that allow us to make inferences about the values of the parameters in the model using the "classical" approach without worries. -- 1. Errors are centered around 0: `\(\mathrm{E}(\epsilon_i) = 0\)`. -- 1. Constant variance: `\(\mathrm{V}ar(\epsilon_i) = \sigma^2\ \text{for all}\ i = 1, 2, \dots, n\)` -- 1. Independence of errors: `\(\epsilon_1, \epsilon_2, \dots, \epsilon_n\)`. -- 1. Identically distributed errors: `\(\epsilon_i \sim \left(0, \sigma^2\right)\)` -- Additionally, for now we will assume: - Normally distributed errors: `\(\epsilon_i \overset{iid}{\sim} Normal\left(0, \sigma^2\right)\)` --- ### What does this mean? Some consequences of these assumptions are that: -- 1. `\(\mathrm{E} \left(Y_i \mid X_i \right) = \beta_0 + \beta_1X_{i}\)` -- 1. `\(\mathrm{V}ar\left(Y_i \mid X_i \right) = \mathrm{V}ar\left(\epsilon_i\right) = \sigma^2\)` -- 1. The OLS estimators `\(\hat{\beta}_0\)` and `\(\hat{\beta}_1\)` are optimal. -- Now, if we have `\(\epsilon_i \overset{iid}{\sim} Normal\left(0, \sigma^2\right)\)` Then we know that: `$$\hat{\beta}_0 \sim N\left(\beta_0,\ \frac{\sigma^2\sum_{i=1}^n x_i^2}{n\sum_{i=1}^n (x_i - \bar{x})^2}\right)\quad,\quad \hat{\beta}_1 \sim N\left(\beta_1,\ \frac{\sigma^2}{n\sum_{i=1}^n (x_i - \bar{x})^2}\right)$$` --- ### What happens if we loose the normality assumption? As long as our errors `\(\epsilon_i\)` have an expectation equal to 0 (**assumption 1**), constant variance (**assumption 2**), are independent (**assumption 3**), and are identically distributed (**assumption 4**). -- Then by the Central Limit Theorem (**aka CLT**) we have that: `$$\hat{\beta}_0 \overset{\bullet}{\sim} N\left(\beta_0,\ \frac{\sigma^2\sum_{i=1}^n x_i^2}{n\sum_{i=1}^n (x_i - \bar{x})^2}\right)\quad,\quad \hat{\beta}_1 \overset{\bullet}{\sim} N\left(\beta_1,\ \frac{\sigma^2}{n\sum_{i=1}^n (x_i - \bar{x})^2}\right)$$` -- - This means that our estimates follow **approximately** a Normal distribution. In other words, although our confidence intervals might not have the correct length or our **p-values** will not be exactly right, they will be **pretty close!** --- class: middle, center, inverse ### Let's do a simulation study! Open a new R file on your project and we will do this one together. --- ### How do we test the model assumptions? - The short answer is... we can't, however, we can use visualization methods that will indicate to us if something is wrong. -- - First we will need to calculate the model errors which can be obtained with the following function: `$$\hat{\epsilon}_i = y_i - \hat{\mu}_i$$` -- - These errors are known as **residuals** and we can use them to visually inspect if the assumptions of the model are "correct". -- - We will go back to our blood pressure example, were we defined the model: `$$\text{blood pressure}_i = \beta_0 + \beta_1\text{age}_i + \epsilon_i \quad and \quad \hat{\mu}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{age}_i$$` --- ### Resdual plots - The histogram should be approximately centered at zero and symmetric. -- - A qq-plot is a particular class of scatter plots which takes the sampled quantiles of our observations and compares them to the theoretical quantiles of a distribution, in this case the standard normal. -- - We use scatter plots to compare the residuals across different independent variables, for example the observation number (order). -- - Auto-correlation plots to check that there is no correlation between our variables across time, for example that the residual `\(n\)` is not correlated to `\(n+1\)`. -- - All these methods could allow us to evaluate if the **simple linear model** is a good approach to the data that we have.